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A SEMI-AUTOMATIC METHOD TO GUIDE THE CHOICE 
OF RIDGE PARAMETER IN RIDGE REGRESSION 

By Erika Cule* and Maria Be iorio"'' 

psj . Imperial College London* and University College London ^ 

' We consider the application of a popular penalised regression 

, method, Ridge Regression, to data with very high dimensions and 

■ many more covariates than observations. Our motivation is the prob- 

lem of out-of-sample prediction and the setting is high-density geno- 

, type data from a genome-wide association or resequencing study. 

' Ridge regression has previously been shown to offer improved per- 

I formance for prediction when compared with other penalised regres- 
sion methods. One problem with ridge regression is the choice of 

I I ' an appropriate parameter for controlling the amount of shrinkage of 

Qh I the coefficient estimates. Here we propose a method for choosing the 

■ ridge parameter based on controlling the variance of the predicted 
I t ' observations in the model. 

, Using simulated data, we demonstrate that our method outper- 

' forms subset selection based on univariate tests of association and an- 

' ' ' other penalised regression method, HyperLasso regression, in terms 

, of improved prediction error. We extend our approach to regression 

problems when the outcomes are binary (representing cases and con- 
trols, as is typically the setting for genome-wide association studies) 
. and demonstrate the method on a real data example consisting of 

\^ ' case-control and genotype data on Bipolar Disorder, taken from the 

, Wellcome Trust Case Control Consortium and the Genetic Associa- 

■ tion Information Network. 

o '■ 
CN ■ 

' 1. Introduction. We are interested in the problem of out-of-sample 

^ . prediction in a high-dimensional regression setting, when data consist of 

■'-j I many more predictor variables than observations. Our motivation is the 

' prediction of a phenotype (observable characteristic) of interest using an 

. individual's genetic information, and possibly other covariates. In this in- 

troduction we describe the problem of phenotype prediction using genetic 
data, and we outline the specific issue, selection of the ridge parameter in 
Ridge Regression (RR, Hoerl and Kennard (1970)), that we address in this 
paper. 

1.1. Cenetic Risk Prediction. The data that motivate this paper arise in 
contemporary research into the relationship between genetic and phenotyp- 
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ical variation. Modern Genome- Wide Association Studies (GWAS) investi- 
gate a phenotype of interest, which may be a continuous trait (such as blood 
pressure) or (more commonly in GWAS) a binary trait describing a disease 
case (1) or disease-free control (0). The corresponding genetic data consist of 
typed genetic variants (normally single nucleotide polymorphisms (SNPs)) 
for each individual. Current technology allows the typing of thousands to 
hundreds of thousands of variants directly, and this number may be aug- 
mented to millions through imputation. These genetic variants are densely 
spaced along the genome and are highly correlated due to the phenomenon 
of linkage disequilibrium (LD), the co- inheritance of nearby genetic variants 
that results in their frequencies deviating from those that would be expected 
if each SNP was inherited independently. 

The phenotypes can be viewed as the outcome variable in a regression 
setting, with the genetic variants as the predictors. As genetic variants are 
fixed at birth, but phenotypes may not present until later in life, there 
is the potential for genetic data to be used to predict future phenotype. 
In a clinical setting, such a prediction could suggest lifestyle modifications 
or pharmaceutical interventions that would prevent or delay the onset of 
disease. Then, our aim is to construct a model to predict the phenotype of 
interest using the genotype data from individuals for whom the (potentially 
future) disease state is unknown. 

RR [Hoerl and Kennard (1970)] is a means of estimating regression co- 
efficients when data are high-dimensional and/or contain correlated vari- 
ables. This is often the case in genetic data, as described above. RR can 
be used to obtain stable parameter estimates when standard multiple re- 
gression approaches would fail. Among a number of regularized regression 
methods, RR was shown to perform best in terms of predictive performance 
[Prank and Friedman (1993)]. By making use of an orthogonal transforma- 
tion of the high-dimensional data, RR coefficients can be computed in a 
manner that is computationally efficient. In this paper we demonstrate that 
RR has good predictive performance in high-dimensional settings. In addi- 
tion, we propose a method for choosing the penalty parameter when there 
are more predictors than observations. 

Our focus here is on prediction. Much of the analysis of genetic data to 
date has focused on the identification of causal variants. Using GWAS data 
with case-control phenotypes, statistical differences in genotype at a certain 
genetic locus in populations of cases and matched controls is seen as indica- 
tive of association. A typical approach is a "single-SNP" analysis, in which 
each SNP is tested individually for association with the outcome variable. 
Due to the large number of tests, a stringent threshold for significance is 
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required. SNPs that are found to be associated with outcome at some sig- 
nificance threshold are viewed as either affecting the outcome, or hkely to 
be correlated with such a SNP. In the latter case, further SNPs near the 
associated SNP are typed in an attempt to identify the causal variant(s). 
Once identified, these genetic variants can give insight into the biological 
mechanisms of the disease and may indicate potential drug targets. 

Existing attempts to use genotype information to predict disease risk 
have taken causal variants identified in previous studies and used these to 
construct a predictive model. One approach is to base the increase in dis- 
ease risk on a cumulative count of known risk variants [Meigs et al. (2008)]. 
The model may take into account the estimated effect size of each variant 
[Karlson et al. (2011)]. To date, the improvement in prediction of disease 
risk based on genetic data over traditional risk prediction models (using 
clinical data and family history) has been too small to have clinical utility 
[Talmud et al. (2010)]. Whilst genome-wide association studies have collec- 
tively identified thousands of genetic variants associated with diverse phe- 
notypes [Hindorff et al. (2009)], overall, effect sizes of identified variants are 
small and much heritability remains unexplained [Eichler et al. (2010)]. Sug- 
gested sources of this so-called "missing heritability" include variants whose 
effect size does not meet the stringent significance threshold required in uni- 
variate tests, rare variants of large effect which current studies are under- 
powered to detect, or the combined effect of multiple SNPs whose individual 
effects do not reach univariate significance. 

The single-SNP tests described above do not take into account the com- 
bined effect of multiple SNPs. However, ordinary multiple regression cannot 
be applied to genetic data due to the high dimensionality of the data and 
the correlation among SNPs. Previous work on the analysis of all SNPs si- 
multaneously [Ayers and Cordell (2010); Hoggart et al. (2008)] focused on 
the identification of causal variables when genotypes are in LD. The pe- 
nalised regression methods used in these studies estimate some regression 
coefficients (effect sizes of variables) to be exactly zero and remove the 
corresponding variables from the model. This results in the information 
in the removed variables being lost if the estimated effect size falls be- 
low the effective significance threshold determined by the penalty param- 
eters in the penalised regression. RR, in contrast, estimates the effect size of 
all the variables. We suggest that this will improve predictive performance 
[Kooperberg, LeBlanc and Obenchain (2010)]. The estimated effect sizes of 
non-causal variants will be centred around zero and will cancel each other 
out, whereas the true but small effects will be retained in the model. 
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1.2. The choice of shrinkage parameter in Ridge Regression. RR was 
proposed as a means of estimating regression coefficients with smaller mean- 
square error than their least squares counterparts when predictors are corre- 
lated [Hoerl and Kennard (1970)]. RR has been extended to the logistic re- 
gression model [Lee and Silvapulle (1988); Cessie and Houwelingen (1992)] 
and has been applied to genetic data when predictor variables are in high LD 
with the aim of improving identification of causal SNPs [Malo, Libiger and Schork 
(2008)]. Cule, Vineis and De lorio (2011) proposed a test of significance of 
regression coefficients estimated using RR, facilitating the use of RR to 
guide variable selection in the analysis of genetic data. In common with 
other shrinkage methods such as the LASSO [Tibshirani (1996)] or the elas- 
tic net [Zou and Hastie (2005b)], RR requires the specification of a penalty 
parameter that controls the degree of shrinkage of the coefficient estimates. 
A number of methods have been proposed to estimate this parameter in RR 
based on the data, but no consensus method provides a universally opti- 
mum choice. Hoerl and Kennard (1970) proposed a graphical method, the 
ridge trace. Based on the test of significance of ridge regression coefficients, 
Cule, Vineis and De lorio (2011) introduced the p-value trace, a plot of the 
significance of RR coefficients with increasing shrinkage parameter. The p- 
value trace can be used to evaluate the changing significance of the regression 
coefficients with increased shrinkage parameter. However, graphical meth- 
ods rapidly become unfeasible when the number of predictors is large. Data- 
driven methods have also been proposed. Some popular choices are those of 
Hoerl, Kennard and Baldwin (1975) and Lawless and Wang (1976). How- 
ever, both of these rely on least-squares estimates of the parameters, which 
do not exist when the number of variables in the model exceeds the number 
of observations. 

Cross-validation is another approach that has been proposed to choose the 
ridge parameter [Golub, Heath and Wahba (1979)]. However, this has the 
disadvantage of being computationally intensive. Techniques corresponding 
to those proposed for linear RR have been used in logistic RR [Vago and Kemeny 
(2006)]. A simulation study investigated the performance of various logistic 
ridge estimators; however, the simulations only consider regression prob- 
lems with fewer predictors than observations, and do not conclude that a 
single ridge estimator is best in terms of reducing mean squared error in all 
regression situations. 

In this paper, we propose a semi-automatic method to guide the choice of 
ridge parameter in RR. Our proposed method is valid when the regression 
problem comprises more predictors than observations. The method proposed 
here is motivated by two observations: 
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The first observation is that the ridge estimator proposed by Hoerl, Kennard and Baldwin 
(1975) is based on the "ideal" ridge estimator. This "ideal" estimator is 
defined through the true (but unknown) regression coefficients, and is de- 
rived by first transforming the model into canonical form. In their ridge 
estimator, Hoerl, Kennard and Baldwin (1975) replace the unknown regres- 
sion coefficients in the "ideal" estimator with their least squares counter- 
parts. In data, such as genetic data, with more predictors than observa- 
tions, these least squares estimates do not exist. However, due to the cor- 
relation structure in genetic data, it is likely that most of the variance in 
the predictors is explained by the first few components when the model 
is transformed to canonical form. By transforming the model in this way 
and using only the first few principal components (PCs) in the estimator 
of Hoerl, Kennard and Baldwin (1975), we can feasibly construct a ridge 
parameter based on the meaningful component directions in a genetic data 
set. 

The second motivation is arrived at by considering the relationship be- 
tween RR and principal components regression (PGR, Hastie, Tibshirani and Friedman 
(2009)). RR coefficients can be derived by shrinking the coefficients from 
a PGR including all the components, in proportion to their correspond- 
ing eigenvalues. Goefficients corresponding to components which explain a 
smaller proportion of variance in the data are shrunk more. Hastie and Tibshirani 
(1990) define the degrees of freedom for variance of a penalised regression 
fit. In PGR, the degrees of freedom for variance are equal to the number of 
components retained in the model; in RR, the degrees of freedom for vari- 
ance depend on the ridge parameter. By restricting the degrees of freedom 
for variance in a ridge regression fit to be the same as for a corresponding 
PGR we can develop a model with the same prediction variance as a PGR. 
In situations that commonly arise in GWAS, this results in a model with 
smaller bias than the corresponding PGR. This is because whilst PGR dis- 
cards the information in the components not used in the model, RR 'spreads 
out' the shrinkage across all components, thus making more efficient use of 
the degrees of freedom for variance. The components that explain less of the 
variance in the data are penalised more. The result is a model with smaller 
average prediction error. 

The remainder of this paper is organised as follows: In Section 2 we intro- 
duce RR and derive our proposed ridge estimator. In Section 3 we illustrate 
aspects of our approach using simulation studies. In Section 4 we present an 
evaluation of our method using data on Bipolar Disorder from the Wellcome 
TYust Gase-Gontrol Gonsortium (WTGGG) [WTGGG (2007)] to construct a 
prediction model. This model is then evaluated using data from the Genetic 
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Association Information Network (GAIN) [Smith et al. (2009)]. In Section 
5 we discuss some computational issues that arise when fitting a ridge re- 
gression model to very high-dimensional data. 

2. Ridge estimator. 

2.1. Regression. Consider a regression problem in which each data point 
(Yi,Xi), i = I, . . . ,n, comprises a response variable Yi and a corresponding 
vector of p predictors, Xj = (xi, . . . , Xp). In the linear regression model, the 
relationship between n such responses and predictors is given by: 

(2.1) Y = X/3 + e 

Here, Y = {Y\, . . . ,Yn) and X is an n x p matrix with rows Xj. /3 = 
(/3i, . . . , j3p) is a vector of regression coefficients, and e = (ei, . . . , e^) is a vec- 
tor of independent and identically distributed normally distributed random 
errors, E (e^) = and E (e?) = o"^. When X is of full rank, the least-squares 
estimators of the p parameters /3i . . . (3p are uniquely estimated by 

(2.2) /3 = (X'X)~^X'Y 

2.2. Regularised regression. Ordinary least squares (OLS) estimates of /3 
are often not appropriate in the analysis of genetic data. Where predictors 
are highly correlated, as is often the case for genetic data, OLS estimates 
can be unstable, having large variance. In a GWAS, nearby SNPs can be 
exactly correlated, leading to exact collinearity in X. Then, estimates of f3 
are not uniquely defined because the matrix X'X is singular and cannot be 
inverted. When the number of predictors, p, exceeds the number of obser- 
vations, n, again OLS estimates of /3 are not uniquely defined. A number 
of penalised regression approaches have been proposed to address this prob- 
lem. Two of these are described here: ridge regression (RR) and principal 
components regression (PGR). 

In both RR and PGR, as in ordinary least squares regression (OLSR), 
the fitted values of Y can be expressed as 

(2.3) Y = HY 

Here, H is the 'hat matrix', or projection matrix, and it relates the fitted 
values Y to the observed values Y. In OLSR, RR and PGR, H is of the 
form 

(2.4) H = XGX' 
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The definition of G depends on tlie model being fitted. In OLSR, G = 
(X'X)"\ (see Equation (2.2)) and Y = X(X'X)"^X'Y = X/3. Both RR 
and PGR use G that approximates (X'X)~^ in a different way [Brown 
(1993)]. 

In RR, G = (X'X + klp)~^ where k is the ridge parameter and Ip is the 
p-dimensional identity matrix. 

In PGR, G is formed by taking the eigendecomposition of X'X: 

(2.5) X'X = QAQ' 

Golumns qj of Q form the eigenvectors of X'X and A = diag (Ai > A2 > • • • > Ap_i > Ap) 
is the diagonal matrix with diagonal elements the eigenvalues of X'X in de- 
scending order. Then, G for a PGR using the first r components is given 
by 

r 

(2.6) G = ^(l/A,)g,g;. 

i=i 

2.3. Ridge regression estimators. As discussed in Section 1.2, a number 
of methods have been proposed in the literature to estimate k based on 
the data. One popular estimator, proposed by Hoerl, Kennard and Baldwin 
(1975), is easier to understand when the linear regression model in Equation 
(2.1) is written in canonical form: 

With columns of X centred and scaled such that X'X is in correlation 
form, the eigendecomposition of X'X (Equation (2.5)) yields a matrix Q 
with columns qj the eigenvectors of X'X and the diagonal elements Xj of A 
the corresponding eigenvalues. 

Taking Z = XQ and cx = Q'/9, the model in Equation (2.1) can be 
rewritten in canonical form as 

(2.7) Y = Za + e 

Golumns of Z are the t = in.m{n,p) principal components of X. The OLS 
estimator for a is 

(2.8) a = A^^Z'Y 
and this is related to as 

(2.9) a = Q'/3 

In PGR, a are the regression coefficients of a subset of the PGs that form 
the columns of Z. In RR, all coefficients are used. With ridge parameter 
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k, estimates of RR coefficients are given, in canonical form, by 

(2.10) = 

These estimated coefficients can be returned to the original form of the 
model using the transformation in Equation (2.9). 

Hoerl, Kennard and Baldwin (1975) propose estimating k from the data 

as: 

(2.11) kuKB = ^ = ^ 

a " (3(3 

This ridge estimator is motivated as the harmonic mean of the 'ideal' 
generalized ridge estimator in terms of minimizging MSE. a in this estima- 
tor are the PGR coefficients, and are estimated as in equation (2.8). is 
estimated by 

(2.12) = (Y-Xa)-(Y-Xa) 

n — p 

We also investigated the performance of ridge estimators based on that 
proposed by Lawless and Wang (1976). However we found that these per- 
formed less well than estimators based on the estimator of Hoerl, Kennard and Baldwin 
(1975). 

As mentioned in Section 1.2, the ridge estimator fcnKB (Equation (2.11)) 
is not useful when the data comprise more predictors than observations, as 
in such situations OLS estimates of the parameters cr^ and /3 do not exist, 
resulting in /chkb being undefined. However, with the model in canonical 
form, kuKB is written as: 

" 2 

(2.13) fcHKB ^'^ 



a'a 



■? (Y - Za)'(Y - Zq) 
(2.14) = ^ ' ^ ' 



n — p 

We consider a regression situation in which most of the variance in X is 
explained by its first few PCs. Then, we propose a ridge penalty that uses 
the harmonic mean of the 'ideal' penalties of only these ffist r components, 
resulting in a ridge penalty of the form: 

(2.15) kr = ^^-^2' 
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Here 



(2.16) 




(y - Zrar)'(y - ZrOr) 



n — r 



and Zr and fir contain the first r columns and the first r elements of Z and 
ol respectively. Before can be used in practise, we must decide the number 
of PCs, r, to use in its calculation. 

As a starting point, we address whether it is more useful to include all 
non-zero PCs (of which there are at most min(n,p)), or to include fewer 
than all the non-zero PCs. To this end we reanalyse the data analysed by 
Hoerl, Kennard and Baldwin (1975), extending their results to compare the 
shrinkage parameter kr to A:hkb • The data being reanalysed are a ten-factor 
dataset consisting of 36 observations. These data were first discussed by 
Gorman and Toman (1966) and are described in Daniel, Wood and Gorman 
(1999). They relate to describing the operation of a petroleum refining unit. 
Following the approach taken by Hoerl, Kennard and Baldwin (1975), we 
use the ten-factor dataset as a design matrix in a simulation study. In each 
replicate, a vector of regression coefficients with a specified squared length is 
simulated. As in Hoerl, Kennard and Baldwin (1975) we find that, subject to 
normalisation, our results are not sensitive to this value. Response variables 
are simulated at a range of signal-to- noise ratios. For each signal-to- noise 
ratio, 1000 replicates are simulated, and results are reported as an average 
of these. Hoerl, Kennard and Baldwin (1975) tabulate the mean squared 
error under both the linear and ridge models and report the percentage of 
replicates linear regression gives rise to estimates /3 with smaller smaller 
mean squared error than ridge estimates with ^hkb defined as in 

Equation (2.11). Following this approach, in Figure 1 we plot the percentage 
of replicates that results in ridge estimates /3^^ with smaller mean squared 
error than the estimates obtained using the shrinkage parameter /crkb- From 
this figure we see that, when the signal to noise ratio is not too low, estimates 
of (5 with smaller mean squared error are obtained using k^ with r < p than 
when using /chkb- 

With evidence that using k^ with r < p as a shrinkage estimator can result 
in improved estimates of (3 compared to when /chkb is used, our method is 
naturally extensible to data with more predictors than observations {p > n). 
Whilst OLS estimators are not defined when there are more predictors than 
observations, the eigenvectors and corresponding eigenvalues of the correla- 
tion matrix can be determined, and the first r of these used to compute kr- 
Again, we need to determine whether inclusion of all nonzero PCs results in 
estimates with smallest prediction error or whether smaller prediction error 
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is obtained when using the first r PCs, r < n. In Table 2 we report aver- 
age PSE when kr is calculated using different numbers of PCs, in simulated 
genetic data with more predictors than observations. Because results are 
averaged over ten replicates, instead of reporting PSE at different values of 
r, we report PSE at different cumulative proportions of variance explained 
by the the PCs, because in different replicates the correlation structure, 
and thus the proportion of variance explained by a given number of princi- 
pal components, differs. We see that minimum PSE is obtained when fewer 
than the maximum number of PCs are used to compute kj.. 

2.4. Semi-automatic choice ofk. We investigate two methods for choos- 
ing the number of PCs, r, to use in the computation of kr- We use the cross- 
validated PRESS statistic [De lorio, Ebbles and Stephens (2007)], and we 
use the number of PCs such that the degrees of freedom for variance of the 
model is equal to the number of PCs used in its calculation. This results in 
a model with the same degrees of freedom for variance as a PCR using r 
PCs, but with the penalization shared out among all the PCs. 

We are interested in the predictive performance of a model on independent 
test data. For a regression fit which can be expressed in the form of Equation 
(2.3), PSE can be written as: 

9 tr(HH') , b'b 

(2.17) PSE = + ^ + 

n n 

where b = Y — X/9 is the bias, the distance between the fitted estimates 

and the true ones. The first term measures the (unavoidable) noise in Y, 

the second measures variance in the prediction estimates. Above, we have 

determined expressions for H in OLSR, RR and PCR. In OLSR, tr (HH') = 

p, the number of parameters in the regression fit. The aim in penalised 

regression is to reduce the variance whilst introducing a little bias, keeping 

the overall PSE lower than that of OLSR. In PCR, tr (HH') = r, the number 

of components used in the penalised regression fit. 

In RR, it is straightforward to find a shrinkage parameter k such that 

tr(HfcH^) = r where r is any specified value, by noting that tr(HfcH^) = 
x2 

^^=1 and using numerical methods to find k. Thus, we can compare 

PCR and RR in terms of prediction error when the variance, — -cr , is 
equal in both models. With Var(y) = fixed, we are only interested in the 
bias term and we can find an expression for this also, by noting that: 

(2.18) b= [X-XGX'X]/3 
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In OLSR, the estimates are unbiased (b = 0). In a simulation study with 
(3 known, we can compare the bias in RR and PGR when the variance of 
the fitted Y in each model is fixed such that tr (HH') = r. 

In PGR, the coefficients of the first r PGs are their least squares counter- 
parts; the coefficients of the remaining components are set to zero. Thus the 
bias is the difference between zero and the least squares estimate of the co- 
efficient of the r-\-l...t^^ components, where t = min(n,p) is the maximum 
number of PGs. 

In RR the bias is more 'spread out' among the t components as the least 
squares estimate of each coefficient is 'shrunk' by j^^- 

We illustrate this using a simulation study. The patterns of predictors and 
coefficients used by Zou and Hastie (2005a) are used here, to cover a range 
of parameter values and correlation structures. All scenarios comprise more 
observations than predictors; nonetheless they illustrate the bias- variance 
decomposition of the PSE. The error in Y is accounted for by other terms 
in the prediction error; therefore in calculating the bias we consider only 
the predictors and coefficients. The four regression scenarios are detailed in 
Table 1. 

In Figure 2, we plot h'h/n using b defined as in Equation (2.18) for 
r = 1, . . . ,t. We see that for regression scenarios (1), (3) and (4), the bias 
is typically lower for RR than for PGA. The only scenario in which PGA 
has lower bias than RR is scenario (2), where there is moderate correlation 
among the predictors but all the coefficients have the same effect size, a 
situation that is unlikely to arise in genetic data. We can see the smooth de- 
crease in the bias with RR whereas in PGR the bias decreases in a stepwise 
manner, with each step corresponding to the inclusion of one more compo- 
nent in the model. As r approaches t, the fitted coefficients approach their 
least squares counterparts and the bias approaches 0, its value in OLSR. 

2.5. Degrees of Freedom: other definitions. OLSR, RR and PGR all re- 
sult in models of the form given in Equation (2.3). For models that can 
be expressed in this form, several definitions of effective degrees of freedom 
have been proposed [Hastie and Tibshirani (1990)]. 

The effective number of parameters, tr(H), gives an indication of the 
amount of fitting that H does. As discussed in section 2.4, tr(HH') can 
be defined as the effective degrees of freedom for variance. The degrees of 
freedom for error, as used in the denominator of the estimate of o"^ (Equation 
(2.12)) is given by ?i— tr(2H — HH'), thus the effective number of parameters 
in the degrees of freedom for error is tr(2H — HH'). 

In OLSR, RR and PGR it can be shown that tr(HH') < tr(H) < 

imsart-aoas ver. 2011/11/15 file: Cule_AQAS.tex date: March 3, 2013 



12 



E. CULE AND M. DE lORIO 



tr(2H - HH') (Hastie and Tibshirani (1990), see Appendix B). In OLSR, 
all three definitions of degrees of freedom reduce to to p, the number of pa- 
rameters in the model. In PGR, all three definitions reduce to r, the number 
of components retained in the PGR. In RR with k > 0, the three definitions 
take values that follow the above inequalities. For each of the definitions, it 
is possible to choose the ridge parameter such that the effective degrees of 
freedom equal some specified value. For a fixed value of the effective degrees 
of freedom, it can be shown that /chh' < < fc2H-HH' where /ehh' is k 
such that tr(HH') = r, kn is k such that tr(H) = r and A;2h-hh' is k such 
that tr(2H — HH') = r (for the same value of r in all three cases). The larger 
the value of k, the further the ridge estimates are from the OLS estimates. 
Thus, choosing k such that tr(HH') = r (among the three definitions of de- 
grees of freedom) results in regression coefficient estimates that are closest 
to the OLS estimates. The proof of these assertions is given in Appendix B. 

2.6. A prior on the number of components. RR has a Bayesian interpre- 
tation. RR coefficients are the maximum a posteriori estimates in a regres- 
sion setting where, using the canonical form of the linear regression model: 



(2.19) 1" ~ AA(Zq,(7^I) 

Further, we assume that elements of cx are exchangeable and have a prior 
distribution 



(2.20) aj^M{0,al) 
Then, a Bayes estimator for cx is a*, [Lawless and Wang (1976)] with 

(2.21) a* = — ^^ti,- 

If we estimate by cP' as in Equation (2.14), and set a\ equal to then 
we obtain the ridge estimator /chkb (Equation (2.11)) of Hoerl, Kennard and Baldwin 
(1975). To obtain the proposed estimator k^ (Equation (2.15)), we estimate 

cP' by al (Equation (2.16)), and set a\ equal to — ■' . 

To continue with the estimation of a from a Bayesian perspective, we can 
put a prior on the number of components, r. Then, the marginal prior on a 
becomes 



(2.22) -'A/' ( 0,^Pr(i? = r)(T, 



V 

2 

r=\ 
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We considered the effect of different prior distributions on r on the marginal 
prior on a. Simulation studies showed that the use of the proposed estima- 
tor kr with r chosen such that r = tr (H'H) results in a ridge regression 
estimator that shrinks the coefficients more than when a uniform prior on 
r, Pr {R = r) = ^.^^^ , was used in Equation (2.22) (results not shown). 

2.7. Ridge Logistic Regression. The logistic regression model is com- 
monly used to model the effect of one or more predictors when the response 
is binary. In logistic regression, similar problems arise when estimating re- 
gression coefficients when data are highly correlated or high dimensional. 
Maximum likelihood estimates may have large variance or, in cases of ex- 
act collinearity or more predictors than observations, be undefined. Ridge 
logistic regression has been considered in the literature as a means of ad- 
dressing these difficulties. Schaefer, Roi and Wolfe (1984) propose a 'Ridge 
type' estimator and demonstrate that it will result in coefficient estimates 
with smaller mean squared error than the maximum likelihood estimator 
when the predictor variables are collinear. The ridge penalty proposed by 
Schaefer, Roi and Wolfe (1984) is: 

(2.23) k = ^ 

(3(3 

Lee and Silvapulle (1988) consider this ridge estimator together with an al- 
ternative, and find the performance in terms of mean squared error depends 
on the structure of the data. Cessie and Houwelingen (1992) use generalized 
cross-validation to guide their choice of ridge parameter. 

Principal components logistic regression (PCLR, Aguilera, Escabias and Valderrama 
(2006)) has been used to overcome the problems associated with correlated 
predictors in logistic regression. In PCLR, the PCs of X are used as covari- 
ates in the logistic regression model. We use increasing numbers of PCs in 
a PCLR to compute the shrinkage parameter and degrees of freedom. For a 
PCLR using r components, the corresponding penalty is calculated as 



where Qr is the vector of r regression coefficients computed using PCLR. 

To estimate the effective degrees of freedom of the logistic regression 
model, we used the trace of the square of the hat matrix (as in linear re- 
gression). The hat matrix is calculated as: 

H = (X'WX + kiy^ X'WX 
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where W = diag(7r(xj)(l — 7r(xj))). vr(xj) are the fitted probabihties, see 
Appendix A. As in Hnear RR, we compute the effective degrees of freedom 
for variance as tr(HH'). We use simulations studies to evaluate several 
guidelines for choosing the number of components r to use in the calculation 
of kr- We apply these guidelines to a real data example. 

3. Simulation study. To evaluate the properties of the proposed shrink- 
age estimator, kr, we performed simulation studies. 

Firstly, we sought a rule for the number of PCs, r, to use to compute 
kr- Because the aim was to identify a shrinkage estimator with good pre- 
dictive performance, we compared the predictive performance of RR mod- 
els fitted using kr when r took a range of values. We also considered two 
rules that determined how many PCs to use: choosing r such that the de- 
grees of freedom for variance, tr(HH'), is equal to r (see section 2.5), and 
determining the best value of r using the cross-validated PRESS statistic 
[De lorio, Ebbles and Stephens (2007)]. The cross- validated PRESS statistic 
selects the number of PCs that minimize the in-sample prediction error. 

In the second part of our simulation studies, we used the rule from the first 
part that had best predictive performance. We then compared the predictive 
performance of RR models fitted using kr to that of competing regression 
approaches that can be applied to high-dimensional data. 

3.1. Simulated data. Because our motivation is risk prediction using ge- 
netic data, we used simulated genetic data for the simulation studies. The 
data were simulated SNP data generated using the software FREGENE 
[Hoggart et al. (2007); Chadeau-Hyam et al. (2008)] as a panmictic pop- 
ulation of 21,000 haplotypes. The simulated data are available for down- 
load from http://www.ebi.ac.uk/projects/BARGEN/. We used a region 
of approximately 7Mb, containing 20,000 SNPs with minor allele frequency 
(MAE) > 1%. 

Genotype and corresponding phenotype data were simulated with both 
continuous and binary outcomes, analysed using linear and logistic regres- 
sion respectively. Each replicate consisted of 1,000 training individuals and 
500 test individuals, and results were averaged over ten replicates. Data were 
generated and analysed as follows: 

Continuous outcomes - analysed using linear regression. 200 SNPs with 
MAF 10 - 15 % were randomly selected to be causal SNPs; these causal 
SNPs were assigned an effect size drawn from a uniform distribution 
?7[0.05, 0.1]. All other SNPs were given an effect size of 0. Thus the vec- 
tor of effect sizes, f3, of length 20,000, contained 200 non-zero elements. 
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Genotypes were generated by summing two randomly selected haplo- 
types. Responses were generated as Y = X/3 + e, e ~ M{0, 1). Model 
performance was evaluated using prediction squared error (PSE) over 
the test data: 

(3.1) PSE = -5](y.-y.) 

i=l 

where Yi is the fitted outcome of the i*^ individual. 
Binary outcomes - analysed using the logistic model. Case-control out- 
comes were generated by randomly selecting two haplotypes which 
were summed to generate a simulated genotype. The probability of an 
individual with that genotype being a case was generated as Pr (Yi = l|xj) = 
exp(— 5 + x-/3)/(l-|-exp(— 5 + X-/3)), and that individual's case-control 
status was determined randomly according to this probability. This 
process was repeated until the required (equal) number of cases and 
controls was obtained. 

For the data with binary outcomes, predictive performance was mea- 
sured using classification error, as in Cessie and Houwelingen (1992): 

r = 0,7r(x,) < 0.5|jl^, = 1,71- (x,) > 0.5 

(3.2) CE, = i i ^(xi) = 0.5 

[ 1 y* = 1, vr (x,) < 0.5| \Yi = 0, vr (x,) > 0.5 

Here, vr (xj) is the estimated probability that the i^^ individual is a 
case based on his genotypes, that is pr (yj = l|xj). In the results we 
present, we take the average CE: 

1 " 

(3.3) Average CE = - V CE^ 

n ^-^ 

i=\ 

3.2. Simulation study results. First we considered the predictive perfor- 
mance of kj. using different numbers of PCs. Because PCs explain different 
proportions of variance in different simulation replicates, instead of com- 
paring results at different values of r we compare predictive performance at 
different proportions of variance explained. Table 2 shows prediction squared 
error or classification error at different proportions of variance explained. In 
the column labelled MAX, r is the maximum number of PCs where the cor- 
responding eigenvalues are non-zero. CV is kr with r chosen using the cross- 
validated PRESS statistic, and DofF is kr chosen such that tr(HH') = r. 
From this table we see that, for both continuous and binary outcomes, the 
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best predictive performance is obtained when somewhat fewer than the max- 
imum number of PCs is used to compute kr- We see that of the two rules 
we investigated, choosing r such that tr(HH') = r offers best predictive 
performance. 

In Figure 3a we pfot the fitted RR coefficients with different values of 
r used to compute k^; Figure 3b shows the corresponding p-values. These 
plots are taken from one simulation replicate with continuous outcomes. The 
number of components chosen using tr(HH') = r, is indicated and a causal 
variant is highlighted. We see that using this rule to choose r results in a 
shrinkage parameter in a region of the ridge trace, and the corresponding 
p-value trace, where the RR coefficients and their corresponding p-values 
are stable and do not change much with further increases in the number of 
principal components. 

Having demonstrated that the rule of choosing r such that tr(HH') = r 
performs well among different ways to choose r, we compared the perfor- 
mance of RR using kr to two competing methods of fitting prediction models 
to high dimensional data. The two other methods were: 

1. Variable selection followed by multivariate regression. Univariate linear 
or logistic regression was used to estimate the strength of association 
of each predictor variable with the outcome. A proportion of the pre- 
dictor variables that were most strongly associated with the outcome 
in univariate tests were included in a multiple linear or logistic regres- 
sion model. When more than one predictor was in high LD and both 
predictors cannot be included in a multiple regression model simulta- 
neously, only one of the correlated predictors was included. Because 
the number of causal variables is not known apriori, we evaluated the 
predictive performance when a range of proportions of predictors were 
included in the multiple regression. 

2. HyperLasso regression [Hoggart et al. (2008)] is a penalized regres- 
sion method that simultaneously considers all predictor variables in 
a high-dimensional regression problem. HyperLasso was originally ap- 
plied to the problem of identifying causal variables when predictors 
are correlated, but it was shown that by using a less stringent thresh- 
old for inclusion of predictors in the model, HyperLasso could be used 
to address the problem of prediction. In order to obtain good predic- 
tive performance, it is necessary to relax the penalty so that sufficient 
coefficients are estimated as non-zero. The penalty in HyperLasso re- 
gression is such that, among a group of correlated predictors, only 
one coefficient will be estimated as non-zero. This is a disadvantage in 
prediction using genetic data, where several correlated predictors may 
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contain information that is useful for prediction even if if they are not 
all causal variables. 

HyperLasso requires the specification of two parameters to control 
the amount of shrinkage. Following Eleftherohorinou et al. (2009) the 
shape parameter in HyperLasso regression was fixed to 3.5 and the 
penalty parameter was chosen using ten-fold cross-validation. 

In Table 3 we report the results, comparing mean prediction squared error 
or mean classification error using these three different regression approaches. 
We see that RR outperforms the competing methods for this regression prob- 
lem, which is a realistic simulation of the problem of risk prediction in genetic 
data. Using univariate variable selection followed by multiple regression, the 
best performance was obtained when the number of predictors included in 
the model was equal to the number of non-zero regression coefficients when 
the data were generated, a proportion that would not be known in practise. 
In HyperLasso regression, the cross-validation required to choose the penalty 
parameter was computationally demanding, and we found in order to obtain 
the best predictive performance a relaxed penalty was necessary. RR had 
the advantage of the rule to guide the choice of kr being computationally 
more straightforward as well as offering better predictive performance. 

4. Application to Bipolar Disorder Data. A performance compar- 
ison similar to that described in the previous section was performed using 
real SNP data taken from two GWAS of Bipolar Disorder (BD). BD is a 
complex neurobehavioral phenotype, characterised by episodes of mania and 
depression. The lifetime prevalence of BD is estimated to be in the region 
of 1 %. The heritability of BD has been estimated to be as high as 85 % 
[McGuffin et al. (2003)]. A number of loci have been identified as associ- 
ated with BD; however replication studies have not always been successful 
[Alsabban, Rivera and McGuffin (2011)]. It is thought that many genes of 
small effect contribute to the liability to develop BD. This hypothesis has 
been offered as an explanation for the underwhelming findings from BD 
GWAS [Serretti and Mandelh (2008)]. 

We use data from the WTCCC [WTCCC (2007)] to construct models 
to predict BD status. We then evaluate these models using an indepen- 
dent test data set taken from the Genetic Association Information Network 
[Smith et al. (2009)]. 

Before the model was fitted and evaluated, data were pre-processed and 
quality control checks were performed following the documented procedures 
accompanying each data set. Briefly, individuals and genotype calls that had 
been identified as poor quality by the WTCCC were removed from the data. 

imsart-aoas ver. 2011/11/15 file: Cule_AQAS.tex date: March 3, 2013 



18 



E. CULE AND M. DE lORIO 



Missing genotypes were imputed using Impute2 [Howie, Donnelly and Marchini 
(2009)], with genotypes with the highest posterior probability being used in 
the analysis. For the GAIN data, only individuals with European ances- 
try and unambiguous phenotype were used in the test data. Pre-imputation 
quality control involved removing one of each of pairs of individuals iden- 
tified as related in the data, removing invariant SNPs, SNPs with call rate 
< 98%, and SNPs with Hardy- Weinberg value < le~^. Individuals iden- 
tified as outliers by the program EIGENSTRAT [Price et al. (2006)] were 
removed from their respective datasets. 

In order to evaluate the predictive models, it was necessary to have the 
same predictors (SNPs) in both the training and test data sets. Approxi- 
mately 300,000 autosomal SNPs that were common to both datasets were 
used in the analysis. PLINK vl.07 [http : //pngu.mgh. harvard, edu/ -pur cell/pl ink/, 
Purcell et al. (2007)] was used for pre-imputation quality control and data 
preparation steps. 

Having obtained directly typed and imputed SNPs such that we had the 
same SNPs in the two datasets, predictive models were fitted. In these data 
with a binary outcome, the logistic model was used to describe the relation- 
ship between genotypes and disease status. 

In performing variable selection followed by multiple regression, instead 
of including a pre-defined proportion of all predictor variables in the mul- 
tiple regression model, we chose a significance threshold (p- value cutoff) 
for a variable to be included. We evaluated predictive performance at a 
range of p-value thresholds. In HyperLasso regression, as before, we fixed 
the shape parameter as 3.5 and chose the penalty parameter using tenfold 
cross-validation. 

Regression coefficients were estimated using RR, with shrinkage parame- 
ter kr as in Equation (2.15). Results are presented with r chosen such that 
the degrees of freedom for variance of the resultant model is equal to the 
number of components used in the computation of kr. In order to prevent 
local regions of high linkage disequilibrium (LD) overwhelming the principal 
components, the training data were thinned to 1 SNP every lOOkb before 
choosing the number of principal components and computing k^. This thin- 
ning of the data was evaluated in the simulation studies in the previous 
section, but thinning did not affect predictive performance in that case (re- 
sults not shown). Fitted coefficients were subsequently estimated on the full 
set of SNPs. 

Results comparing the predictive performance of our proposed method 
with that of models based on univariate tests of significance and models 
fitted using HyperLasso regression are presented in Table 4. In models fitted 
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using univariate variable selection followed by multiple regression, relaxing 
the significance threshold for inclusion of a SNP in the model quickly lead to 
more SNPs reaching the threshold than there are observations in the data. 
With more predictors than observations, a multiple regression model can- 
not be fitted. Thus when using the univariate variable selection approach, we 
necessarily discard information in the large number of SNPs that are moder- 
ately associated with outcome. HyperLasso regression presents the problem 
of the choice of the two penalty parameters which control the amount of 
shrinkage. Choice of the parameters by cross-validation is computationally 
intensive, becoming unfeasibly so for large data sets such as this one. Our 
method has the advantage of not requiring cross-validation and offering im- 
proved predictive performance. Again we see that our proposed estimator 
offers good predictive performance in comparison to other regression ap- 
proaches as well as having computational advantages. 

5. Computational Issues. We have developed an R package, ridge, 
which implements our method. This package addresses the computational 
challenges that arise when fitting RR models to high-dimensional data such 
as genome-wide SNP data. For data sets that are too large to read into R, 
code written in C is provided and the corresponding R functions take file 
names as arguments. This circumvents the need to read large datasets into 
the R workspace whilst retaining a user-friendly interface to the code. 

Our method requires a number of matrix operations and linear algebra 
functions which can be computationally costly for large matrices. When a 
compatible graphical processing unit (GPU) is available, our software makes 
use of it, dramatically reducing computational time. NVIDIA CUDA Archi- 
tecture [NVIDIA (2011)] is used to access the GPU and CULA [Humphrey et al. 
(2010)] is used for the matrix operations and linear algebra functions. 

Logistic RR is performed using the CLG algorithm [Genkin, Lewis and Madigan 
(2007)]. CLG is a cyclic coordinate descent algorithm which updates each 
coefficient in turn, holding the other coefficients fixed until convergence is 
reached. This removes the need to repeatedly manipulate an entire data 
matrix at once and makes logistic RR feasible even when the data contain 
hundreds of thousands of predictor variables. For details of the CLG algo- 
rithm, see Appendix A. 

6. Discussion. In this paper we have introduced a semi-automatic method 
to guide the choice of shrinkage parameter in ridge regression. Our method 

is particularly useful when the regression problem comprises more predictor 
variables than observations, a situation that often arises in genetic data. This 
is because existing ridge estimators, such as that proposed by Hoerl, Kennard and Baldwin 
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(1975) cannot be computed in such settings. 

Disease risk prediction using genetic information remains a challenging 
problem due to the high dimensionality and correlation structure of the 
data. RR is a technique that addresses these difficulties and has been shown 
to offer good predictive performance. Our method facilitates the application 
of RR to genetic data for the construction of prediction models. 

Our method has computational and practical advantages over competing 
methods. Because the choice of shrinkage parameter is semi-automatic, our 
method does not require computationally intensive cross-validation proce- 
dures such as those required by HyperLasso regression. Nor is determination 
of causal variables necessary, as is the case when selecting predictor variables 
to include in a multivariate model fitted using OLSR. 

Using simulation studies, we demonstrate that our method outperforms 
HyperLasso regression in terms of prediction error when data comprise more 
predictors than observations and there are many causal variables with small 
effects, a situation that is representative of genetic data. We demonstrate the 
good predictive performance our method by using data from two genome- 
wide association studies to construct and evaluate a prediction model. 

We have developed an R package [R Development Core Team (2011)], 
ridge, which implements our method. By taking file names as arguments, 
the R package can handle hundreds of thousands of predictors and thousands 
of observations. By default, ridge computes the ridge parameter using our 
proposed method, but it flexibly allows the user to specify the ridge param- 
eter or the number of components to use to compute it. Because RR is a 
regression approach, the can be extended to include additional, non-genetic 
covariates. For example, clinical information or PCs to correct for popu- 
lation structure could be included. Such covariates can be accommodated 
by our package, where they may be penalised or not penalised. It would 
be possible to extend the approach to investigate higher-order interaction 
terms, albeit at an increased computational cost. Given the large number of 
predictor variables in a GWAS, it may be necessary to perform a variable 
selection step before investigating interaction effects. 

APPENDIX A: LOGISTIC RIDGE REGRESSION BY CYCLIC 

COORDINATE DESCENT 

In this section, we describe logistic RR and cyclic coordinate descent, the 
algorithm which we use to compute logistic RR coefficients. 

In the logistic regression model, let X be an n x matrix of predictors with 
rows Xj = {xii, . . . ,Xip), as in the linear regression model (Equation (2.1)). 
Let Y = (Yi, . . . , Yn) be a vector of observed binary outcomes, Yi € {0, 1}. 
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In biomedical data, this setup is common. The outcome variable represents 
disease status with cases as 1 and controls as 0. 

The i^^ response Yi is a. Bernoulli variable with probability of success 
vr (xj). The logistic regression model relates the probability vr (xj) that the 
i*^ observation is a case to the predictor variables as 

(A.l) Pr(y, = l|x,)=vr(x,,) = ^|^ 

where (3 is a vector of parameters to be estimated. Logistic RR estimates are 
obtained by maximising the log-likelihood of the parameter vector, subject 
to a penalty term. The penalised log-likelihood to be maximised is: 

n n 

(A.2) m = ^yaog(7r(x,)) + ^^(1 - y.)log(l - 7r(x,)) - A^^^H 

i=l i=l 

The CLG algorithm [Zhang and Oles (2001)] is a cyclic coordinate descent 
algorithm for penalised logistic regression. The algorithm is described in 
detail by Genkin, Lewis and Madigan (2007). The CLG algorithm is initi- 
ated by setting all of the coefficient estimates to an initial value. Then, each 
coefficient is updated in turn whilst holding the rest fixed. This has the ad- 
vantage of avoiding the need for the inversion of large matrices. Convergence 
is checked after each round of updating all of the coefficients. 

In the CLG algorithm, cases are code as 1^ = 1 and controls as Yi = 
— 1. Finding the updated coefficient, /3^°'" that maximises the log-likelihood 
whilst keeping the other parameters fixed is equivalent to finding the z that 
minimizes 

(" \ 2 

J^/(r, + (z-/3,)x,,y,)j 

where t = and the = (3'ii-iyi are computed using the current value of (5 
and so are treated as constants, /(r) = ln(l -|- exp(— r)), and penalty terms 
not involving z are constant and therefore omitted. 

The (5^^^^ that gives the minimum value of g{-) does not have a closed form, 
so each component- wise update requires an optimization process. Zhang and Oles 
(2001) use a modification of Newton's method in computing the component- 
wise updates. The proposed updates are adaptively bounded to prevent large 
updates in regions where a quadratic is a poor approximation to the objec- 
tive. Following Genkin, Lewis and Madigan (2007) we use as the proposed 
update: 

/A4^ A _ TJl=i{xijyi)/{l + exp(r^)) - /3j/r 
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Genkin, Lewis and Madigan (2007) use 

X r ( 0.25 if \r\ < 6 

(A.5) F{r,6) = \ , otherwise 

I 2+exp(|rh5)+cxp(5-|r|) otiierwise 

but other functions can be used [Zhang and Oles (2001)]. We then apply the 
trust region restriction: 

(A.6) " = max(2|A/3j I, Aj/2) 

to give the actual update: 



-Aj if Auj < -Aj 



(A.7) APj = < Avj if - Aj < Avj < A 



Aj if Aj < Auj 



Convergence is declared when i^^^i |^'^i|)/(l+Z]"=i ^^i) < where Y17=i l^'^* 
is the sum of the changes in the linear scores once all the coefficients have 
been updated, and e is a user-specified tolerance. The CLG algorithm is 
summarized in Algorithm 1. 



Algorithm 1 CLG [Zhang and Oles (2001); Genkin, Lewis and Madigan 
(2007)] 

Initialize: pj ^ 0, Aj <— 1 for j = 1, . . . ,p; ri for i — 1, . . . ,n 
do 

for j = 1 . . .p 

compute tentative step Afj (Equation (A. 4)). 
APj <— (max()) (Limit step to trust region) 
An •(— APjXijyi, ri <^ n + An for i = 1, . . . , n 

^ /?, + A/?, 

Aj -ir- max(2jA/3j|, Aj/2) (update size of trust region) 
end 

while (Er=iiA^«i)/(i+Er=i'-o>^ 

end 
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APPENDIX B: DEFINITIONS OF DEGREES OF FREEDOM IN 
PENALISED REGRESSION MODELS 

In RR: 

(B.l) tr(HH') < tr(H) < tr(2H - HH') 



Proof. 



H = X (X'X + kl) X' 

= UDV (VD^V + kl) VD'U' 
= UDV' (V(D2 + A;)V') VD'U' 
= U [dV(d2 + A:)] U' 



tr(H) is the sum of the t diagonal elements of H. Each element is less than 
or equal to 1. tr(HH') is also the sum of t diagonal elements, this time of 
HH', and each of which is the square of the corresponding diagonal element 
of H. These diagonal elements each take a value between and 1, thus the 
sum of their squares is less than or equal to the sum of the original elements. 
A similar argument holds for the diagonal elements of 2H — HH', where the 
sum is greater than or equal to the sum of the diagonal elements of H: 

t 

trace(H) = ^ + k) t = mm{n,p) 

i=i 
t 

trace(HH') = ^j/^^] + '^)^ 
j=i 

i 

trace(2H - HH') = Yxj{X^. + 2k) /{X] + kf 

□ 

Corollary: For a fixed value of the degrees of freedom, kuw < < 
^2H-HH' where /chh' is k such that tr(HH') = r, kn is k such that tr(H) = r 
and A:2H-HH' is k such that tr(2H — HH') = r (for the same value of r in 
all three cases). 
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Proof. We seek kn and /chh' such that: 



t 



= r 



For each diagonal element j = 1 



t: 



A'(a2 + A:hh')' = A^A2 + A:h) 



Which simplifies to 

(2 + ^)fcHH' = ku 

(2 + ^) > so /CH > ^HH' 

An analogous argument shows that A:2H-HH' > ^H- 



□ 



The larger the value of k, the further the ridge estimates are from the OLS 
estimates, as expressed in Equation (2.10). This relationship holds when the 
ridge estimates are returned to the orientation of the data by the relationship 
in Equation (2.9). 

In RR with k > 0, the three definitions of degrees of freedom follow the 
inequalities given in (B.l). For each of the definitions, it is possible to choose 
the ridge parameter such that the degrees of freedom equal some specified 
value. Thus, choosing k such that tr(HH') = r (among the three definitions 
of degrees of freedom) results in regression coefficient estimates that are 
closest to the OLS estimates. 
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APPENDIX C: TABLES AND FIGURES 

Table 1 

Four simulation scenarios used in the evaluation of the bias-variance decomposition. The 
simulation scenarios are taken from Zou and Hastie (2005b) 



scenario 


n p 


/3 






Structure of X 


(1) 


100 8 


(3,1.5,0,0,2,0,0,0) 






corr (z, j) = 0.5' 


(2) 


100 8 


0.85 for all j 






corr (i, j) = 0.5'*"^' 


(3) 


50 40 


^'-{l i = (ll,. 


.,10,21,.. 
..,20,31,. 


.,30) 
..,40) 


corr = 0.5 for all i and j 



X, = Zi+eJ, Zi~Ar(0,l) j = l,...,5 

40 /?-/° j = (l,...,15) X, =Z2+6j, Z2^M{0,1) j=6,...,10 

W ^ j = (16,...,40) x, = Z3 + 6j, Z-,^X{0,1) i = ll,...,15 

X, ~AA(0,1) i = 16,...,40 



Table 2 

Prediction squared error using r with different proportions of variance explained. 

Proportion of variance explained (%) RR parameter 



Binary outcomes (mean CE) 



10 


50 


70 


90 


MAX 


CV 


DofF 


1.24 


1.23 


1.23 


1.27 


3.20 


1.24 


1.23 


0.46 


0.47 


0.47 


0.47 


0.47 


0.47 


0.46 



Table 3 
Prediction error - simulated data 







Univariate 






HLasso 


RR 


% of SNPs ranked by univariate p-value 


0.1% 


0.5% 


1 % 


3% 


4% 






Continuous outcomes (mean PSE) 


1.51 


1.55 


1.54 


2.21 


3.93 


2.41 


1.23 


Binary outcomes (mean CE) 


0.49 


0.48 


0.48 


0.49 


0.50 


0.50 


0.46 



In Tables 2 and 3, CV refers to RR with the number of PCs used to 
compute kr chosen by cross-vahdation, and DofF refers to RR with the 
number of PCs used to compute kr chosen using tr(HH') = r. In Table 
3, in HyperLasso regression, the shape parameter was fixed to 3.5 and the 
penalty parameter was chosen using tenfold cross-validation, and in Ridge 
Regression the shrinkage parameter kr was used. 
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Table 4 

Prediction error - Bipolar Disorder data. Logistic RR models were fitted on WTCCC-BD 
data. Results are reported as mean classification error over the GAIN-ED data. 

Univariate HyperLasso Ridge Regression 

p-value threshold 10"^ 10"'' 10"^° 
Mean classification error 0.489 0.491 0.490 0.492 0.465 
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Fig 1. Comparing the mean-squared error of ridge regression estimates obtained using the 
shrinkage parameter k,. to those obtained using the shrinkage parameter kuKB- Plotted are 
the proportion of replicates that using kr results in larger mean squared error than the 
estimates fitted using knKB (equivalent to kr with r = p). 
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Bias with increasing r Bias with increasing r 




12345678 12345678 



Bias with increasing r Bias with increasing r 




10 20 30 40 10 20 30 40 



Fig 2. Bias m PGR and RR in regression scenarios (1), (2), (3) and (4) (Table 

1 ), at different values of r . 
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Fig 3. a. Ridge trace showing estimated regression coefficients and b. p-value trace of RR 
coefficients estimated using kr computed using different numbers of PCs. The x-axis shows 
the number of PCs used to compute kr . The vertical dotted line indicates that our proposed 
method of choosing the number of components, r, chooses a ridge parameter in the region 
where ridge estimates and their corresponding p-values stabilise. The black line indicates 
a causal variant. Plotted are the first 100 SNPs of the 20,000 m one simulation replicate, 
with continuous outcomes. 
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